Overview

Dataset statistics

Number of variables10
Number of observations20640
Missing cells207
Missing cells (%)0.1%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory2.7 MiB
Average record size in memory137.1 B

Variable types

NUM9
CAT1

Reproduction

Analysis started2020-03-22 19:26:00.664504
Analysis finished2020-03-22 19:26:46.263198
Versionpandas-profiling v2.5.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml
latitude is highly correlated with longitudeHigh Correlation
longitude is highly correlated with latitudeHigh Correlation
total_bedrooms is highly correlated with total_rooms and 1 other fieldsHigh Correlation
total_rooms is highly correlated with total_bedrooms and 1 other fieldsHigh Correlation
households is highly correlated with total_rooms and 2 other fieldsHigh Correlation
population is highly correlated with householdsHigh Correlation
total_bedrooms has 207 (1.0%) missing values Missing

Variables

longitude
Real number (ℝ)

HIGH CORRELATION
Distinct count844
Unique (%)4.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean-119.56970445736432
Minimum-124.35
Maximum-114.31
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum-124.35
5-th percentile-122.47
Q1-121.8
median-118.49
Q3-118.01
95-th percentile-117.08
Maximum-114.31
Range10.04
Interquartile range (IQR)3.79

Descriptive statistics

Standard deviation2.003531724
Coefficient of variation (CV)-0.01675618195
Kurtosis-1.330152366
Mean-119.5697045
Median Absolute Deviation (MAD)1.830205859
Skewness-0.297801208
Sum-2467918.7
Variance4.014139367
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[-124.35 -124.185 -124.075 -123.225 -123.195 ... -115.485 -115.345 -114.675 -114.555 -114.31 ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
-118.31 162 0.8%
 
-118.3 160 0.8%
 
-118.29 148 0.7%
 
-118.27 144 0.7%
 
-118.32 142 0.7%
 
-118.28 141 0.7%
 
-118.35 140 0.7%
 
-118.36 138 0.7%
 
-118.19 135 0.7%
 
-118.25 128 0.6%
 
Other values (834) 19202 93.0%
 
ValueCountFrequency (%) 
-124.35 1 < 0.1%
 
-124.3 2 < 0.1%
 
-124.27 1 < 0.1%
 
-124.26 1 < 0.1%
 
-124.25 1 < 0.1%
 
ValueCountFrequency (%) 
-114.31 1 < 0.1%
 
-114.47 1 < 0.1%
 
-114.49 1 < 0.1%
 
-114.55 1 < 0.1%
 
-114.56 1 < 0.1%
 

latitude
Real number (ℝ≥0)

HIGH CORRELATION
Distinct count862
Unique (%)4.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean35.63186143410853
Minimum32.54
Maximum41.95
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum32.54
5-th percentile32.82
Q133.93
median34.26
Q337.71
95-th percentile38.96
Maximum41.95
Range9.41
Interquartile range (IQR)3.78

Descriptive statistics

Standard deviation2.135952397
Coefficient of variation (CV)0.05994501302
Kurtosis-1.117759781
Mean35.63186143
Median Absolute Deviation (MAD)1.975024291
Skewness0.4659530037
Sum735441.62
Variance4.562292644
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[32.54 32.555 32.665 32.735 32.815 ... 40.605 40.765 40.805 40.96 41.95 ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
34.06 244 1.2%
 
34.05 236 1.1%
 
34.08 234 1.1%
 
34.07 231 1.1%
 
34.04 221 1.1%
 
34.09 212 1.0%
 
34.02 208 1.0%
 
34.1 203 1.0%
 
34.03 193 0.9%
 
33.93 181 0.9%
 
Other values (852) 18477 89.5%
 
ValueCountFrequency (%) 
32.54 1 < 0.1%
 
32.55 3 < 0.1%
 
32.56 10 < 0.1%
 
32.57 18 0.1%
 
32.58 26 0.1%
 
ValueCountFrequency (%) 
41.95 2 < 0.1%
 
41.92 1 < 0.1%
 
41.88 1 < 0.1%
 
41.86 3 < 0.1%
 
41.84 1 < 0.1%
 

housing_median_age
Real number (ℝ≥0)

Distinct count52
Unique (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean28.639486434108527
Minimum1.0
Maximum52.0
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum1
5-th percentile8
Q118
median29
Q337
95-th percentile52
Maximum52
Range51
Interquartile range (IQR)19

Descriptive statistics

Standard deviation12.58555761
Coefficient of variation (CV)0.4394477408
Kurtosis-0.8006288536
Mean28.63948643
Median Absolute Deviation (MAD)10.55153852
Skewness0.0603306376
Sum591119
Variance158.3962604
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 1.5 3.5 9.5 13.5 ... 46.5 48.5 50.5 51.5 52. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
52 1273 6.2%
 
36 862 4.2%
 
35 824 4.0%
 
16 771 3.7%
 
17 698 3.4%
 
34 689 3.3%
 
26 619 3.0%
 
33 615 3.0%
 
18 570 2.8%
 
25 566 2.7%
 
Other values (42) 13153 63.7%
 
ValueCountFrequency (%) 
1 4 < 0.1%
 
2 58 0.3%
 
3 62 0.3%
 
4 191 0.9%
 
5 244 1.2%
 
ValueCountFrequency (%) 
52 1273 6.2%
 
51 48 0.2%
 
50 136 0.7%
 
49 134 0.6%
 
48 177 0.9%
 

total_rooms
Real number (ℝ≥0)

HIGH CORRELATION
Distinct count5926
Unique (%)28.7%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2635.7630813953488
Minimum2.0
Maximum39320.0
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum2
5-th percentile620.95
Q11447.75
median2127
Q33148
95-th percentile6213.2
Maximum39320
Range39318
Interquartile range (IQR)1700.25

Descriptive statistics

Standard deviation2181.615252
Coefficient of variation (CV)0.8276977802
Kurtosis32.630927
Mean2635.763081
Median Absolute Deviation (MAD)1344.462236
Skewness4.147343451
Sum54402150
Variance4759445.106
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[2.00000e+00 3.20500e+02 6.73500e+02 9.27500e+02 1.12450e+03 ... 1.02815e+04 1.33025e+04 1.77790e+04 2.20580e+04 3.93200e+04], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
1527 18 0.1%
 
1613 17 0.1%
 
1582 17 0.1%
 
2127 16 0.1%
 
1703 15 0.1%
 
1471 15 0.1%
 
2053 15 0.1%
 
1722 15 0.1%
 
1607 15 0.1%
 
1717 15 0.1%
 
Other values (5916) 20482 99.2%
 
ValueCountFrequency (%) 
2 1 < 0.1%
 
6 1 < 0.1%
 
8 1 < 0.1%
 
11 1 < 0.1%
 
12 1 < 0.1%
 
ValueCountFrequency (%) 
39320 1 < 0.1%
 
37937 1 < 0.1%
 
32627 1 < 0.1%
 
32054 1 < 0.1%
 
30450 1 < 0.1%
 

total_bedrooms
Real number (ℝ≥0)

HIGH CORRELATION
MISSING
Distinct count1923
Unique (%)9.4%
Missing207
Missing (%)1.0%
Infinite0
Infinite (%)0.0%
Mean537.8705525375618
Minimum1.0
Maximum6445.0
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum1
5-th percentile137
Q1296
median435
Q3647
95-th percentile1275.4
Maximum6445
Range6444
Interquartile range (IQR)351

Descriptive statistics

Standard deviation421.3850701
Coefficient of variation (CV)0.7834321252
Kurtosis21.98557506
Mean537.8705525
Median Absolute Deviation (MAD)270.9236064
Skewness3.459546332
Sum10990309
Variance177565.3773
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
280 55 0.3%
 
331 51 0.2%
 
345 50 0.2%
 
393 49 0.2%
 
343 49 0.2%
 
394 48 0.2%
 
328 48 0.2%
 
348 48 0.2%
 
272 47 0.2%
 
309 47 0.2%
 
Other values (1913) 19941 96.6%
 
(Missing) 207 1.0%
 
ValueCountFrequency (%) 
1 1 < 0.1%
 
2 2 < 0.1%
 
3 5 < 0.1%
 
4 7 < 0.1%
 
5 6 < 0.1%
 
ValueCountFrequency (%) 
6445 1 < 0.1%
 
6210 1 < 0.1%
 
5471 1 < 0.1%
 
5419 1 < 0.1%
 
5290 1 < 0.1%
 

population
Real number (ℝ≥0)

HIGH CORRELATION
Distinct count3888
Unique (%)18.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1425.4767441860465
Minimum3.0
Maximum35682.0
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum3
5-th percentile348
Q1787
median1166
Q31725
95-th percentile3288
Maximum35682
Range35679
Interquartile range (IQR)938

Descriptive statistics

Standard deviation1132.462122
Coefficient of variation (CV)0.7944444737
Kurtosis73.55311639
Mean1425.476744
Median Absolute Deviation (MAD)714.2372769
Skewness4.935858227
Sum29421840
Variance1282470.457
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[3.0000e+00 2.2650e+02 3.5250e+02 4.8050e+02 5.9650e+02 ... 5.0375e+03 7.6865e+03 9.9075e+03 1.3062e+04 3.5682e+04], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
891 25 0.1%
 
761 24 0.1%
 
1227 24 0.1%
 
850 24 0.1%
 
1052 24 0.1%
 
825 23 0.1%
 
999 22 0.1%
 
782 22 0.1%
 
1005 22 0.1%
 
781 21 0.1%
 
Other values (3878) 20409 98.9%
 
ValueCountFrequency (%) 
3 1 < 0.1%
 
5 1 < 0.1%
 
6 1 < 0.1%
 
8 4 < 0.1%
 
9 2 < 0.1%
 
ValueCountFrequency (%) 
35682 1 < 0.1%
 
28566 1 < 0.1%
 
16305 1 < 0.1%
 
16122 1 < 0.1%
 
15507 1 < 0.1%
 

households
Real number (ℝ≥0)

HIGH CORRELATION
Distinct count1815
Unique (%)8.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean499.5396802325581
Minimum1.0
Maximum6082.0
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum1
5-th percentile125
Q1280
median409
Q3605
95-th percentile1162
Maximum6082
Range6081
Interquartile range (IQR)325

Descriptive statistics

Standard deviation382.3297528
Coefficient of variation (CV)0.7653641301
Kurtosis22.05798806
Mean499.5396802
Median Absolute Deviation (MAD)247.1953674
Skewness3.410437712
Sum10310499
Variance146176.0399
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[1.0000e+00 6.6500e+01 1.2950e+02 1.6850e+02 2.0350e+02 ... 1.9275e+03 2.4495e+03 2.9035e+03 4.0420e+03 6.0820e+03], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
306 57 0.3%
 
386 56 0.3%
 
335 56 0.3%
 
282 55 0.3%
 
429 54 0.3%
 
375 53 0.3%
 
284 51 0.2%
 
297 51 0.2%
 
362 50 0.2%
 
380 50 0.2%
 
Other values (1805) 20107 97.4%
 
ValueCountFrequency (%) 
1 1 < 0.1%
 
2 3 < 0.1%
 
3 4 < 0.1%
 
4 4 < 0.1%
 
5 7 < 0.1%
 
ValueCountFrequency (%) 
6082 1 < 0.1%
 
5358 1 < 0.1%
 
5189 1 < 0.1%
 
5050 1 < 0.1%
 
4930 1 < 0.1%
 

median_income
Real number (ℝ≥0)

Distinct count12928
Unique (%)62.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.8706710029069766
Minimum0.4999
Maximum15.0001
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum0.4999
5-th percentile1.60057
Q12.5634
median3.5348
Q34.74325
95-th percentile7.300305
Maximum15.0001
Range14.5002
Interquartile range (IQR)2.17985

Descriptive statistics

Standard deviation1.899821718
Coefficient of variation (CV)0.4908249026
Kurtosis4.952524102
Mean3.870671003
Median Absolute Deviation (MAD)1.401613645
Skewness1.646656702
Sum79890.6495
Variance3.60932256
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 0.4999 0.54275 0.68 0.89005 1.124 ... 8.87675 11.11745 13.5359 15.00005 15.0001 ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
3.125 49 0.2%
 
15.0001 49 0.2%
 
2.875 46 0.2%
 
4.125 44 0.2%
 
2.625 44 0.2%
 
3.875 41 0.2%
 
3 38 0.2%
 
3.375 38 0.2%
 
3.625 37 0.2%
 
4 37 0.2%
 
Other values (12918) 20217 98.0%
 
ValueCountFrequency (%) 
0.4999 12 0.1%
 
0.536 10 < 0.1%
 
0.5495 1 < 0.1%
 
0.6433 1 < 0.1%
 
0.6775 1 < 0.1%
 
ValueCountFrequency (%) 
15.0001 49 0.2%
 
15 2 < 0.1%
 
14.9009 1 < 0.1%
 
14.5833 1 < 0.1%
 
14.4219 1 < 0.1%
 

median_house_value
Real number (ℝ≥0)

Distinct count3842
Unique (%)18.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean206855.81690891474
Minimum14999.0
Maximum500001.0
Zeros0
Zeros (%)0.0%
Memory size161.4 KiB

Quantile statistics

Minimum14999
5-th percentile66200
Q1119600
median179700
Q3264725
95-th percentile489810
Maximum500001
Range485002
Interquartile range (IQR)145125

Descriptive statistics

Standard deviation115395.6159
Coefficient of variation (CV)0.55785531
Kurtosis0.3278702429
Mean206855.8169
Median Absolute Deviation (MAD)91170.43994
Skewness0.9777632739
Sum4269504061
Variance1.331614816e+10
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 14999. 37100. 42300. 47450. 54950. ... 449150. 450200. 499550. 500000.5 500001. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
500001 965 4.7%
 
137500 122 0.6%
 
162500 117 0.6%
 
112500 103 0.5%
 
187500 93 0.5%
 
225000 92 0.4%
 
350000 79 0.4%
 
87500 78 0.4%
 
275000 65 0.3%
 
150000 64 0.3%
 
Other values (3832) 18862 91.4%
 
ValueCountFrequency (%) 
14999 4 < 0.1%
 
17500 1 < 0.1%
 
22500 4 < 0.1%
 
25000 1 < 0.1%
 
26600 1 < 0.1%
 
ValueCountFrequency (%) 
500001 965 4.7%
 
500000 27 0.1%
 
499100 1 < 0.1%
 
499000 1 < 0.1%
 
498800 1 < 0.1%
 

ocean_proximity
Categorical

Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size161.4 KiB
<1H OCEAN
9136
INLAND
6551
NEAR OCEAN
2658
NEAR BAY
2290
ISLAND
 
5
ValueCountFrequency (%) 
<1H OCEAN 9136 44.3%
 
INLAND 6551 31.7%
 
NEAR OCEAN 2658 12.9%
 
NEAR BAY 2290 11.1%
 
ISLAND 5 < 0.1%
 

Length

Max length10
Mean length8.064922481
Min length6
ValueCountFrequency (%) 
Uppercase_Letter 13 81.2%
 
Math_Symbol 1 6.2%
 
Decimal_Number 1 6.2%
 
Space_Separator 1 6.2%
 
ValueCountFrequency (%) 
Latin 13 81.2%
 
Common 3 18.8%
 
ValueCountFrequency (%) 
ASCII 16 100.0%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Missing values

Sample

First rows

longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
5-122.2537.8552.0919.0213.0413.0193.04.0368269700.0NEAR BAY
6-122.2537.8452.02535.0489.01094.0514.03.6591299200.0NEAR BAY
7-122.2537.8452.03104.0687.01157.0647.03.1200241400.0NEAR BAY
8-122.2637.8442.02555.0665.01206.0595.02.0804226700.0NEAR BAY
9-122.2537.8452.03549.0707.01551.0714.03.6912261100.0NEAR BAY

Last rows

longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
20630-121.3239.2911.02640.0505.01257.0445.03.5673112000.0INLAND
20631-121.4039.3315.02655.0493.01200.0432.03.5179107200.0INLAND
20632-121.4539.2615.02319.0416.01047.0385.03.1250115600.0INLAND
20633-121.5339.1927.02080.0412.01082.0382.02.549598300.0INLAND
20634-121.5639.2728.02332.0395.01041.0344.03.7125116800.0INLAND
20635-121.0939.4825.01665.0374.0845.0330.01.560378100.0INLAND
20636-121.2139.4918.0697.0150.0356.0114.02.556877100.0INLAND
20637-121.2239.4317.02254.0485.01007.0433.01.700092300.0INLAND
20638-121.3239.4318.01860.0409.0741.0349.01.867284700.0INLAND
20639-121.2439.3716.02785.0616.01387.0530.02.388689400.0INLAND